Overview

I am doing a secondary analysis of gene expression data from tumor biopsy samples in a study of patients with rectal cancer. The dataset includes basic clinical covariates and recurrence data. I will determine what gene expression is associated with recurrence, and also perform gene set enrichment analysis to determine whether any pathways are up- or down-regulated in tumors that respond to chemoradiation compared to tumors that do not respond.

Introduction

The standard of care for stage III rectal cancer is combination chemoradiation followed by surgical resection of the primary tumor and lymph nodes. There is interest in developing predictive tools that can predict which patients will have a complete response to chemoradiation and could potentially avoid surgery, which would leave the patient with a colostomy bag.

Traditional clinical factors such as demographic variables, stage, etc are not sufficient to predict which patients will have good responses to chemoradiation, so investigators are focusing on radiomics and genomics to develop predictive models. These are collaborative studies between radiation oncologists, radiologists, and bioinformaticians. I met several times with Mengyuan Kan, who taught me about the RAVED pipeline she developed and how I can adapt it to my chosen data set.

Methods

I used publicly available gene expression and phenotype data from GEO. This study by Watanabe et al. obtained biopsies of the primary tumor for 46 patients with stage III rectal cancer, administered pre-operative chemoradiation, and then resected the primary tumors. Patients were labeled as “Responders” if they had a complete or near-complete pathological response according to a classification system by the Japanese Society for Cancer of the Colon and Rectum. There was no other clinical data included in the dataset.

QC and analysis scripts for differential expression were adapted from the RAVED pipeline PubMed.


This report shows the QC steps for gene expression microarry data from GEO study, including:

Manually define the variables used in the RAVED pipeline

# GEO id
geo_id="GSE35452" 
GPL_ID="GPL570"
# directory stores GEO data
datadir="C:/Users/jahan/Desktop/GSE35452"
# directory stores generated files
resdir="C:/Users/jahan/Desktop/GSE35452"
#platform="Affymetrix" #Try changing to Illumina
platform="Affymetrix"
geo_GPL=""
normdata=FALSE 
suppldata=TRUE
paired=FALSE
usesuppl=TRUE
# The shortname_func function shortens the sample name shown in the plots. To start, define shortname_func <- function(x){x}
shortname_func <- function(x){gsub("^(.*)\\.(cel|CEL).gz","\\1",x)}

Load the necessary libraries. Load affy and dplyr packages later since they will mask other functions.

GEO/ArrayExpress Data Download and Phenotype Preparation

GEO Dataset Download

Download GEO series matrix files if available.

Phenotype information modification

library(dplyr)
pheno.raw=pData(gse)
cols <- c("title","geo_accession","source_name_ch1","response to preoperative chemoradiotherapy:ch1")
pheno <- pheno.raw %>%
            dplyr::select(cols) %>%
            dplyr::rename(Response="response to preoperative chemoradiotherapy:ch1") %>%
            dplyr::mutate(Patient=seq(1:nrow(pheno.raw)),
                          Response=ifelse(Response=="Non-responder","Nonresponder","Responder"),
                          Disease="Rectal_cancer",
                          Tissue="Tumor",
                          Response=factor(Response),
                           GEO_ID=geo_accession)
            
            

pheno$Response[pheno$Response=="Non-responder"]=factor("Nonresponder")
pData(gse)=dplyr::rename(pData(gse),Response='response to preoperative chemoradiotherapy:ch1')
pData(gse)$Response[pData(gse)$Response=='Non-responder']='Nonresponder'
detach("package:dplyr")

Raw intensity data download and define RAVED pipeline functions

Download supplementary raw data files.

## Loading required package: pd.hg.u133.plus.2
## Loading required package: RSQLite
## Loading required package: DBI
## Platform design info loaded.
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868548_102_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868549_K_102.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868550_K_104.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868551_105_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868552_106_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868553_107_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868554_108_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868555_109_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868556_110_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868557_111_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868558_112_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868559_113_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868560_114_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868561_115_RCa.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868562_116_RCa.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868563_118_RCa.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868564_120_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868565_121_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868566_122_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868567_124_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868568_126_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868569_128Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868570_129Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868571_130Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868572_131Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868573_135Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868574_136Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868575_137Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868576_143Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868577_B_149.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868578_B_153.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868579_B_158.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868580_B_164.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868581_B_166.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868582_B_168.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868583_B_170.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868584_B_173.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868585_B_175.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868586_B_179.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868587_B_181.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868588_B_183.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868589_Rad_97.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868590_Rad86.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868591_Rad89.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868592_Rad91.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868593_Rad93.CEL.gz

Assign phenotype data from gse object to raw data object.

Retrieve scan date information from raw.data object for batch effect adjustment. For Affymetrix data, scan date information is imported by oligo with raw data files.

Show the summary of phenotype variables and the sample size for different groups

Sample size in different tissue and disease/Response groups
Tissue Disease Response Count
Tumor Rectal_cancer Nonresponder 22
Tumor Rectal_cancer Responder 24
Sample size in different batch
ScanDate_Group Count
2006 4
2007 22
2008 8
2009 12

Assign colors to scan date or disease/Response if scan date is not available.

If negative/zero intensity values are present, convert them to NAs.

Write the phenotype data set to .txt format for storage

Quality Control for Microarray Data

The major QC steps and scoring methods for outliers were adapted from arrayQualityMetrics. The threshold to determine an outlier used in arrayQualityMetrics is the boxplot’s upper whisker, i.e. values beyond 1.5 times the interquartile range, which is also applied to our pipeline. The following QC metrics are included in a routine analysis. The QC metrics used for outlier detection are marked with an asterisk.

Raw Probe Intensity Boxplots and Density Histograms

The log2-transformed/normalized intensity distributions of all samples (arrays) are expected to have the similar scale (i.e. the similar positions and widths of the boxes). Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ka) between log-intensity distribution for one array and the pooled array data, where an array with a Ka beyond the upper whisker is designated as an outlier.

  1. Outlier detection for log2 raw probe intensity/normalized intensity

Compute the Kolmogorov-Smirnov statistic Ka between each array’s (i.e. sample) values (i.e. log2 transformed raw probe intensity values) and the pooled, overall distribution of the values.

## 0 outlier(s) are detected in the raw intensity metrics.
  1. Boxplots for log2 raw probe intensity

  1. Density curves for log2 raw probe intensity

The intensity curves of all samples (arrays) are expected to have the similar shapes and ranges. Samples with deviated curves are likely to have problematic experiments. For example, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.

RNA Digestion

Overall RNA quality can be assessed by RNA degradation plots. In the gene expression array, each probe is represented by a probe set. Each probe set is 11-20 probes (pairs of oligos). This plot shows the average intensity of each probe across all probe sets, ordered from the 5’ to the 3’ end. It is expected that probe intensities are lower at the 5’ end of a probe set when compared to the 3’ end as RNA degradation starts from the 5’ end of a molecule. RNA which is too degraded will shows a very high slope from 5’ to 3’. Thus, the standardized slope of the RNA degradation plot serves as quantitative indicator of the RNA degradation.

Distribution of Perfect Match (PM) and Mismatch (MM)

There are two paired probe types: perfect match (PM) and of mismatched (MM) probes. A PM probe matches a strand of cDNA, while the corresponding MM probe differs from the PM by a change in the central nucleotide. A probe set is called present if the intensity value of PM is significantly larger than MM. However, the Affymetrix approach is under attack because between 15%-30% of the MM are greater than the PM. For some newer arrays, MM probes are not used. If the number of PMs is not equal to that of MMs, this might be a PM-only array.

If both PM and MM are present, the density curves of log2 PM and MM intensities are generated, where MM probes are expected to have smaller log2-intensity at the peak than PM probes due to their nonspecific hybridization.

MA Plots

MA plots allow pairwise comparison of the log-intensity of each array to a reference array and identification of intensity-dependent biases.

The y-axis of the MA-plot shows the log-ratio intensity of one array to the reference median array, which is called M (minus). M = log2(I1)-log2(I2) (I1: the intensity of the array studied; I2: the median intensity across arrays)

The x-axis indicates the average log-intensity of both arrays, which is called A (add). A = 1/2*(log2(I1)+log2(I2))

It is expected that the probe levels do not differ systematically within a group of replicates, so that the MA-plot is centered on the y-axis (y=0 or M=0) from low to high intensities.

  1. Outlier detection for MA plots

Outlier detection is applied by computing a Hoeffding’s statistic (Da) on the joint distribution of A and M for each array, where an array with a Da >0.15 is designated as an outlier.

## 0 outliers are detected in the MA metrics.
  1. MA plots

MA plots of the samples with the 4 highest and lowest Hoeffding’s statistics.

Spatial Distribution

Spatial plots show an artificial colored image of an array’s spatial distribution of intensities that indicate spatial variation in an array. Log-intensities of probes are plotted by their corresponding spatial x and y-coordinate in the array and are expected to be uniformly distributed if the array data has good quality. The rank scale is applied for plotting as it has the potential to amplify patterns that are small in amplitude but systematic within an array.

  1. Outlier detection for spatial plots

Outlier detection is applied by computing a sum of the absolute values of low frequency Fourier coefficients (Fa) across all probe sets for each array, where an array with a Fa beyond the upper whisker is designated as an outlier.

## 0 outlier(s) are detected in the spatial metrics.
  1. Spatial distribution plots

Spatial distribution plots of samples with the 4 highest and lowest values of Fa. The value of Fa for each sample is shown in the panel headings. Outliers marked with * have Fa values of large scale spatial structures.

Relative Log Expression (RLE) Distribution

The normalized unscaled standard error (NUSE) and relative log expression (RLE) boxplots indicate probe set homogeneity in one array, where the metrics are derived from a fitted probe level model by the fitProbeLevelModel function (oligo). The RLE plots represent the distribution of the ratio between the log-intensity of a probe set and the median log-intensity of the corresponding probe set across all arrays, expected to be centered near 0, as a log scale is applied. Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ra) between RLE distribution for one array and the pooled array data, where an array with a Ra beyond the upper whisker is designated as an outlier

  1. Outlier detection for RLE

Compute the Kolmogorov-Smirnov statistic Ra between each array’s (i.e. sample) values (i.e. relative log expression values) and the pooled, overall distribution of the values. Detect outliers that are deviated from the threshold.

  1. Boxplot for RLE

Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.

Normalized Unscaled Standard Error (NUSE) Outlier Detection and Plots

The NUSE plots show the distribution of normalized standard error estimates, expected to be centered near 1. Outlier detection is applied by computing an upper hinge (Na) across all probe sets for each array, where an array with a Na beyond the upper whisker is designated as an outlier.

  1. Outlier detection for NUSE

Compute 75% quantile Na of each array’s NUSE values Detect outliers that have larger Na deviated from the threshold.

3 outlier(s) are detected in NUSE metrics.

  1. Boxplot for NUSE

Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.

Distance between Samples and Outlier Detection

Distance between arrays is evaluated using mean absolute difference of log-intensity/normalized intensity between each pair of arrays, where the hierarchical tree between arrays is created based on the distance, which is visualized by a heatmap and dendrogram.

The distance d(ab) between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In the formula (the dist2 function from genefilter package), d(ab) = mean | M(ai) - M(bi) |, where M(ai) is the value of the i-th probe on the a-th array.

  1. Outlier detection for sample distance

Outlier detection is applied by computing the sum of the distances of one array to all other arrays (Sa) (Sa=Sum(b)d(ab)), where an array with a Sa beyond the upper whisker is designated as an outlier.

## 0 outlier(s) are detected in sample distance metrics.
  1. Plot distance between samples

Principal Component Analysis (PCA)

PCA demonstrates information of the expression dataset in a reduced number of dimensions. Clustering and PCA plots enable to assess to what extent arrays resemble each other, and whether this corresponds to the known resemblances of the samples.

  1. Compute PCs and variance explained by the first 10 PCs
Variance explained
PC Proportion of Variance (%) Cumulative Proportion of Variance (%)
PC1 42.83 42.83
PC2 10.28 53.11
PC3 4.018 57.13
PC4 3.234 60.36
PC5 2.76 63.12
PC6 2.32 65.44
PC7 1.748 67.19
PC8 1.581 68.77
PC9 1.348 70.12
PC10 1.315 71.44
  1. PCA plots

PCA plots are generated using the first two principle components colored by known factors (e.g. Response/disease conditions, tissue, Patients and scan dates), visualizing similarities between arrays and these similarities’ correlation to batch effects.

QC Summary

The summary of outliers and detection methods

Outlier Summary
  Frequency Method
GSM868591_Rad89.CEL.gz 1 NUSE
GSM868592_Rad91.CEL.gz 1 NUSE
GSM868593_Rad93.CEL.gz 1 NUSE
GSM868554_108_Rad.CEL.gz 1 RLE
GSM868562_116_RCa.CEL.gz 1 RLE

Create a new column “QC_Pass” in the phenotype file. By default, samples detected as an outlier more than twice are assigned to 0 otherwise to 1.

## 0 outlier(s) are defined.

Results

QC identified no significant outlier samples that needed to be discarded. Description of the analysis findings will precede figures and tables below.

Mannually assign variables used in the RAVED pipeline

tissue="Tumor" # naming is rigid, should be same as the Tissue defined in QC
# reference condition
con0="Nonresponder" # naming is rigid, should be same as the Response defined in QC
# altered condition
con1="Responder" # naming is rigid, should be same as the Response defined in QC
# Response. Assign "comparison" if this column is used for DE.
Response=c(con0,con1)
# disease. Assign "comparison" if this column is used for DE.
disease="Rectal_cancer"

pheno_fn=paste0(resdir,"/",geo_id,"_Phenotype_withQC.txt")

Obtain Phenotype and GEO Data

Phenotype data preparation

Read in pre-prepared post-QC phenotype data

Subset phenotypes based on the comparison variables. Check variables (before and after QC if outliers are detected).

Summary of subsetted samples
ScanDate_Group Status Counts
2006 Nonresponder 2
2007 Nonresponder 11
2008 Nonresponder 4
2009 Nonresponder 5
2006 Responder 2
2007 Responder 11
2008 Responder 4
2009 Responder 7
Summary of the comparison
GEO_ID GSE35452
Tissue Tumor
App Response
Disease Rectal_cancer
Response Responder
N_Condition0 22
N_Condition1 24
Total 46
Unique_ID GSE35452_Tumor_Rectal_cancer_Responder_vs_Nonresponder

Assign colours to status and scan date (if available)

Gene expression data preparation

Obtain raw.data.of.interest by subsetting raw.data based on the phenotype of interest

Differential Gene Expression Analysis

Normalize raw gene expression data

Normalize gene expression raw data using robust multi-array average (RMA) method

If negative/zero intensity values are present, convert them to NAs.

Comparison of Responders vs. Non-Responders

Fit a linear model to RMA log-intensity values, fit this model to a contrast matrix for the comparison of interest, and apply empirical Bayes smoothing to obtain more precise standard errors.

A design model matrix for linear model
Sample Nonresponder Responder
GSM868548_102_Rad.CEL.gz 1 0
GSM868549_K_102.CEL.gz 1 0
GSM868550_K_104.CEL.gz 1 0
GSM868551_105_Rad.CEL.gz 1 0
GSM868552_106_Rad_2.CEL.gz 1 0
GSM868553_107_Rad.CEL.gz 0 1
GSM868554_108_Rad.CEL.gz 0 1
GSM868555_109_Rad_2.CEL.gz 0 1
GSM868556_110_Rad_2.CEL.gz 0 1
GSM868557_111_Rad.CEL.gz 1 0
GSM868558_112_Rad.CEL.gz 0 1
GSM868559_113_Rad.CEL.gz 0 1
GSM868560_114_Rad.CEL.gz 0 1
GSM868561_115_RCa.CEL.gz 1 0
GSM868562_116_RCa.CEL.gz 0 1
GSM868563_118_RCa.CEL.gz 1 0
GSM868564_120_Rad_2.CEL.gz 1 0
GSM868565_121_Rad.CEL.gz 1 0
GSM868566_122_Rad_2.CEL.gz 1 0
GSM868567_124_Rad.CEL.gz 0 1
GSM868568_126_Rad.CEL.gz 0 1
GSM868569_128Rad.CEL.gz 1 0
GSM868570_129Rad.CEL.gz 0 1
GSM868571_130Rad.CEL.gz 1 0
GSM868572_131Rad.CEL.gz 0 1
GSM868573_135Rad.CEL.gz 0 1
GSM868574_136Rad.CEL.gz 1 0
GSM868575_137Rad.CEL.gz 0 1
GSM868576_143Rad.CEL.gz 1 0
GSM868577_B_149.CEL.gz 0 1
GSM868578_B_153.CEL.gz 0 1
GSM868579_B_158.CEL.gz 1 0
GSM868580_B_164.CEL.gz 0 1
GSM868581_B_166.CEL.gz 0 1
GSM868582_B_168.CEL.gz 1 0
GSM868583_B_170.CEL.gz 1 0
GSM868584_B_173.CEL.gz 1 0
GSM868585_B_175.CEL.gz 0 1
GSM868586_B_179.CEL.gz 0 1
GSM868587_B_181.CEL.gz 1 0
GSM868588_B_183.CEL.gz 0 1
GSM868589_Rad_97.CEL.gz 0 1
GSM868590_Rad86.CEL.gz 1 0
GSM868591_Rad89.CEL.gz 1 0
GSM868592_Rad91.CEL.gz 0 1
GSM868593_Rad93.CEL.gz 0 1
A contract matrix for comparison
  Response
Nonresponder -1
Responder 1

Adjusting for Batch Effect

Check whether to adjust for scan date and donor. Create a full model that includes all variables and a null model that only includes the batch variable. The batch used for adjustment:

## batcherror = no

Compute F statistic p-values adjusted for batch effect. Q-values are obtained by the Benjamini-Hochberg method. If the batch and the status are correlated, assign NA to the batch adjusted p- and q-values. If there is no batch variable or only one batch, assign p- and q-values computed by limma to the batch adjusted p-values

Assign Official Gene Symbol

Annotate official gene symbol to probes.

## The platform used is pd.hg.u133.plus.2
## The corresponding R annotation database package is hgu133plus2.db

Gene Expression Results Visualization

Volcano Plots

After adjusting p-values for multiple testing, the differential expression analysis showed no significant genes associated with response to chemoradiation as displayed in the volcano plots, histograms, and heatmaps.

Histograms

Histograms of p-value distributions

Top 200 Differentially Expression Results

res=res[!is.na(res$SYMBOL),]

Show top 200 probes sorted by un-adjusted p-values

Boxplots for Top 6 Differentially Expressed Genes

The boxplots for the top differentially expressed genes do show differential expression but after adjustment of the p values none of these were significant.

Heatmap for Top 200 Differentially Expressed Genes based on normalized raw data

Create plotting function

Phylogeny of genes by Response. This shows that the top 200 genes do not show striking clustering according to response to chemoradiation.

Phylogeny of genes by Scan year further confirms that no significant batch effect was present.

Heatmap for Top 200 Differentially Expressed Genes based on matrix data

Having created heatmaps using the normalized raw data, I wanted to create one based on the “matrix” data uploaded by the study authors (which is already normalized)

Fit a model using matrix data

#fit with matrix data
design.gse <- model.matrix(~ -1 + factor(gse$Response))
colnames(design.gse) <- levels(factor(gse$Response))

#Fit a linear model with limma package. Expression data linked to outcome of a design matrix model
fit.gse <- lmFit(gse, design.gse)
#Create contrast groups of interest
gse.contrast <- makeContrasts(Response=Responder - Nonresponder, 
                                  levels = design.gse)
#Get the contrasts for samples of interest
fit.gse2 <- contrasts.fit(fit.gse, gse.contrast)
#Adjust fit coefficients using an empirical Bayes moderation of standard errors
fit.gse2 <- eBayes(fit.gse2)

#Results for each hypothesis test can be extracted using
treatment_results_matrix <- topTable(fit.gse2, coef = "Response", adjust = "BH", num = 200)
col.sel=c("logFC","AveExpr","t","P.Value","adj.P.Val","B","Gene.Symbol") # select the columns of DE results
treatment_results_matrix<-treatment_results_matrix[treatment_results_matrix$Gene.Symbol!='',]
head(treatment_results_matrix[,col.sel])
##                  logFC     AveExpr         t      P.Value adj.P.Val         B
## 207763_at   -1.7191568  0.41139719 -4.668236 2.572281e-05 0.9271418 -3.053453
## 239713_at    1.3627552 -0.23920733  4.367207 6.910754e-05 0.9271418 -3.205459
## 235498_at    1.1391716 -0.16818457  4.254527 9.947878e-05 0.9271418 -3.262589
## 1564202_at   1.1650943 -0.38503094  4.224529 1.095471e-04 0.9271418 -3.277806
## 214727_at    0.7825310 -0.04256456  4.186795 1.236280e-04 0.9271418 -3.296950
## 232077_s_at -0.8883044 -0.27810220 -4.158492 1.353300e-04 0.9271418 -3.311310
##              Gene.Symbol
## 207763_at         S100A5
## 239713_at          CASC2
## 235498_at         LRRIQ3
## 1564202_at  LOC100131756
## 214727_at          BRCA2
## 232077_s_at        YPEL3

A heatmap using the authors’ uploaded normalized data shows almost perfect clustering according to radiation response. It is difficult to tell from their methods how exactly they normalized their data. This further confirms the utility of having raw data available to perform a high quality re-analysis.

Principal Component Analysis (PCA) based on normalized raw data

  1. Compute PCs and variance explained by the first 10 PCs
Variance explained
PC Proportion of Variance (%) Cumulative Proportion of Variance (%)
PC1 10.33 10.33
PC2 8.994 19.33
PC3 7.826 27.15
PC4 7.527 34.68
PC5 4.152 38.83
PC6 3.388 42.22
PC7 3.095 45.32
PC8 2.695 48.01
PC9 2.619 50.63
PC10 2.459 53.09
  1. PCA plots

PCA plots are generated using the first two principle components. The plot shows no significant clustering according to Response to chemoradiation.

Principal Component Analysis (PCA) based on matrix data

Similar to the heatmaps, I wanted to plot a PCA based on the authors’ normalized data to compare it to my analysis of the normalized raw data.

  1. Compute PCs and variance explained by the first 10 PCs
Variance explained
PC Proportion of Variance (%) Cumulative Proportion of Variance (%)
PC1 8.945 8.945
PC2 8.019 16.96
PC3 6.126 23.09
PC4 4.291 27.38
PC5 3.362 30.74
PC6 3.15 33.89
PC7 2.888 36.78
PC8 2.718 39.5
PC9 2.496 42
PC10 2.409 44.4
  1. PCA plots

PCA plots are generated using the first two principle components. Similar to the PCA of the normalized raw data, these do not show significant clustering. However this is in contrast to the heiarchical clustering and perhaps reflects the different methodology these techniques use to identify clusters.

###Gene Set Enrichment Analysis (GSEA)

The previous analyses examined the significance of individual genes. GSEA examines entire pathways and results in enrichment scores that suggest whether a pathway is upregulated, down-regulated, or not differentially expressed. Based on my subject matter knowledge of radiation oncology, before this analysis I hypothesized that I would see over-expression of cell cycle pathways and down-regulation of DNA double strand break repair in the Responding patients compared to non-responding. This is because radiotherapy exerts its anti-cancer effect partially through DNA double strand breaks, and impaired ability to repair these breaks could lead to more cell kill. It is also known that cells actively going through the cell cycle are much more radiosensitive than cells that are quiescent.

Load packages and import differential expression analysis data set generated earlier

library(fgsea)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:AnnotationDbi':
## 
##     select
## The following object is masked from 'package:nlme':
## 
##     collapse
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:oligo':
## 
##     summarize
## The following objects are masked from 'package:Biostrings':
## 
##     collapse, intersect, setdiff, setequal, union
## The following object is masked from 'package:XVector':
## 
##     slice
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(pander)

dir="C:/Users/jahan/Documents/BMIN503_Final_Project/fgsea/"


res_gsea=read.csv("C:/Users/jahan/Desktop/GSE35452/GSE35452_Tumor_Rectal_cancer_Responder_vs_Nonresponder.csv")

res_modi <- res_gsea %>%
  dplyr::mutate(adj.P.Val=P.Value, Gene=SYMBOL) %>%
  dplyr::filter(!is.na(Gene)) %>% # remove probes cannot be mapped to genes
  dplyr::filter(!is.na(P.Value)) %>% # remove probes with NA p.value
  dplyr::arrange(Gene, P.Value,-abs(logFC)) %>% # order by gene name, p value and descending absolute logFC values
  dplyr::group_by(Gene) %>% # group by gene name
  dplyr::filter(row_number()==1) %>% # select first row in each gene
  dplyr::ungroup() %>%
  dplyr::rename(stat=t) %>% # rename t value to stat
  dplyr::select(Gene,stat) %>% # select column names
  dplyr::arrange(stat) %>%
  as.data.frame()
gene_stat <- res_modi$stat
names(gene_stat) <- res_modi $Gene


pathways.msigkeggreac=readRDS(file=paste0(dir,"/","MSigkeggreactome_list.RDS"))

Create FGSEA data set

fgsea_res=fgsea(pathways= pathways.msigkeggreac,stats= gene_stat,minSize=15,maxSize=500,nperm=10000,gseaParam=1)
fgsea_res <- fgsea_res[order(-NES),]

collapsedPathways <- collapsePathways(fgseaRes= fgsea_res, pathways= pathways.msigkeggreac, stats= gene_stat, gseaParam=1)
mainPathways = collapsedPathways$mainPathways

fgsea_res$enriched=ifelse(fgsea_res$NES>0,"Up-regulated","Down-regulated")

fgsea_res_collapse=fgsea_res[fgsea_res$pathway %in% mainPathways,]

Plot lollipop chart of main Pathways in FGSEA. This shows that cell cycle pathways are indeed up-regulated, but contrary to my hypothesis DNA repair pathways are also up-regulated. After examining these results my interpretation is that DNA repair pathways are up-regulated because the cells in Responding patients are more actively cycling than cells in Nonresponding patients. Thus their tumors are radiosensitive because they are cycling and this is in spite of the fact that they are overexpressing DNA repair pathways.

Among pathways that are down-regulated, I don’t see any that to my knowledge are associated with response to radiotherapy.

ggplot(fgsea_res_collapse, aes(x=NES,y=reorder(pathway,NES),color=enriched)) +
        geom_segment(aes(x = 0, y=reorder(pathway,NES), xend = NES, yend = pathway), color = "grey50") +
        geom_point() +xlab("Normalized Enrichment Score") + ylab("REACTOME/KEGG Pathway")

Plot enrichment pathways

plotEnrichment(pathway = pathways.msigkeggreac[["REACTOME_DNA_DOUBLE_STRAND_BREAK_REPAIR"]], stats = gene_stat) + labs(title="REACTOME_DNA_DOUBLE_STRAND_BREAK_REPAIR")

plotEnrichment(pathway = pathways.msigkeggreac[["REACTOME_DNA_REPLICATION"]], stats = gene_stat) + labs(title="REACTOME_DNA_REPLICATION")

plotEnrichment(pathway = pathways.msigkeggreac[["KEGG_CELL_CYCLE"]], stats = gene_stat) + labs(title="KEGG_CELL_CYCLE")

Conclusion

I analyzed gene expression array data to determine whether any genes were differentially expressed in patients who responded to chemoradiation compared to those who did not respond. I was not able to identify any individual genes that were differentially expressed after adjustment for multiple testing. However on gene set enrichment analysis I identified several pathways that were up-regulated in Responders compared to Non-responders. The up-regulated pathways that were most striking were for cell cycle and DNA repair pathways. Cells that are actively cycling are known to be more radioresponsive. Cells with deficient DNA repair are also more radiosensitive, and I believe that DNA repair pathways were upregulated in Responders because their tumors were actively cycling and thus required DNA repair machinery to be available.

Session information

R version 3.6.1 (2019-07-05)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: LC_COLLATE=English_United States.1252, LC_CTYPE=English_United States.1252, LC_MONETARY=English_United States.1252, LC_NUMERIC=C and LC_TIME=English_United States.1252

attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: dplyr(v.0.8.3), hgu133plus2.db(v.3.2.3), org.Hs.eg.db(v.3.8.2), hgu133plus2cdf(v.2.18.0), pd.hg.u133.plus.2(v.3.12.0), DBI(v.1.0.0), RSQLite(v.2.1.2), fgsea(v.1.10.1), Rcpp(v.1.0.3), DT(v.0.10), annotate(v.1.62.0), XML(v.3.98-1.20), AnnotationDbi(v.1.46.1), sva(v.3.32.1), BiocParallel(v.1.18.1), genefilter(v.1.66.0), mgcv(v.1.8-28), nlme(v.3.1-140), limma(v.3.40.6), pander(v.0.6.3), preprocessCore(v.1.46.0), devtools(v.2.2.1), usethis(v.1.5.1), Hmisc(v.4.3-0), Formula(v.1.2-3), survival(v.2.44-1.1), lattice(v.0.20-38), gplots(v.3.0.1.1), ggplot2(v.3.2.1), viridis(v.0.5.1), viridisLite(v.0.3.0), oligo(v.1.48.0), Biostrings(v.2.52.0), XVector(v.0.24.0), IRanges(v.2.18.3), S4Vectors(v.0.22.1), oligoClasses(v.1.46.0), knitr(v.1.26), GEOquery(v.2.52.0), Biobase(v.2.44.0) and BiocGenerics(v.0.30.0)

loaded via a namespace (and not attached): snow(v.0.4-3), backports(v.1.1.5), fastmatch(v.1.1-0), lazyeval(v.0.2.2), splines(v.3.6.1), crosstalk(v.1.0.0), GenomeInfoDb(v.1.20.0), digest(v.0.6.22), foreach(v.1.4.7), htmltools(v.0.4.0), gdata(v.2.18.0), magrittr(v.1.5), checkmate(v.1.9.4), memoise(v.1.1.0), cluster(v.2.1.0), remotes(v.2.1.0), readr(v.1.3.1), matrixStats(v.0.55.0), prettyunits(v.1.0.2), colorspace(v.1.4-1), blob(v.1.2.0), xfun(v.0.11), jsonlite(v.1.6), callr(v.3.3.2), crayon(v.1.3.4), RCurl(v.1.95-4.12), zeallot(v.0.1.0), iterators(v.1.0.12), glue(v.1.3.1), gtable(v.0.3.0), zlibbioc(v.1.30.0), DelayedArray(v.0.10.0), pkgbuild(v.1.0.6), scales(v.1.0.0), xtable(v.1.8-4), htmlTable(v.1.13.2), foreign(v.0.8-71), bit(v.1.1-14), htmlwidgets(v.1.5.1), RColorBrewer(v.1.1-2), acepack(v.1.4.1), ellipsis(v.0.3.0), ff(v.2.2-14), pkgconfig(v.2.0.3), nnet(v.7.3-12), later(v.1.0.0), tidyselect(v.0.2.5), labeling(v.0.3), rlang(v.0.4.1), munsell(v.0.5.0), tools(v.3.6.1), cli(v.1.1.0), fastmap(v.1.0.1), evaluate(v.0.14), stringr(v.1.4.0), yaml(v.2.2.0), processx(v.3.4.1), bit64(v.0.9-7), fs(v.1.3.1), caTools(v.1.17.1.2), purrr(v.0.3.3), mime(v.0.7), xml2(v.1.2.2), compiler(v.3.6.1), rstudioapi(v.0.10), testthat(v.2.3.0), affyio(v.1.54.0), tibble(v.2.1.3), stringi(v.1.4.3), ps(v.1.3.0), desc(v.1.2.0), Matrix(v.1.2-17), vctrs(v.0.2.0), pillar(v.1.4.2), lifecycle(v.0.1.0), BiocManager(v.1.30.10), data.table(v.1.12.6), bitops(v.1.0-6), httpuv(v.1.5.2), GenomicRanges(v.1.36.1), R6(v.2.4.1), latticeExtra(v.0.6-28), promises(v.1.1.0), KernSmooth(v.2.23-15), gridExtra(v.2.3), affxparser(v.1.56.0), sessioninfo(v.1.1.1), codetools(v.0.2-16), gtools(v.3.8.1), assertthat(v.0.2.1), pkgload(v.1.0.2), SummarizedExperiment(v.1.14.1), rprojroot(v.1.3-2), withr(v.2.1.2), GenomeInfoDbData(v.1.2.1), hms(v.0.5.2), grid(v.3.6.1), rpart(v.4.1-15), tidyr(v.1.0.0), rmarkdown(v.1.17), shiny(v.1.4.0) and base64enc(v.0.1-3)